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Abstract 

Under model misspecification, the MLE generally converges to the pseudo-true parameter, 
the parameter corresponding to the distribution within the model that is closest to the dis- 
tribution from which the data are sampled. In many problems, the pseudo-true parameter 
corresponds to a population parameter of interest, and so a misspecified model can provide 
consiste nt estimation for this parameter. Furthermore, the well-known sandwich variance for- 
mula of iHuben (jl967l ) provides an asymptotically accurate sampling distribution for the MLE, 
even under model misspecification. However, confidence intervals based on a sandwich variance 
estimate may behave poorly for low sample sizes, partly due to the use of a plug-in estimate of 
the variance. From a Bayesian perspective, plug-in estimates of nuisance parameters generally 
underrepresent uncertainty in the unknown parameters, and averaging over such parameters is 
expected to give better performance. With this in mind, we present a Bayesian sandwich poste- 
rior distribution, whose likelihood is based on the sandwich sampling distribution of the MLE. 
This Bayesian approach allows for the incorporation of prior information about the parameter 
of interest, averages over uncertainty in the nuisance parameter and is asymptotically robust 
to model misspecification. In a small simulation study on estimating a regression parameter 
under heteroscedasticity, the addition of accurate prior information and the averaging over the 
nuisance parameter are both seen to improve the accuracy and calibration of confidence intervals 
for the parameter of interest. 

Keywords: estimating equations, exponential family, model misspecification, pivotal quantity. 



This note is part of a discussion of "Bayesian inference with misspecified models" by Stephen Walker. Replicatii 
code for the simulation study is available at the first author's website: www.stat.washington.edu/~hoff 
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1 Introduction 



Let X be the data resulting from an experiment, survey or random process, and let 9 denote 
some fixed but unknown aspect of the the data generating process. Before the experiment is run, 
both X and 9 are uncertain. A subjective Bayesian uses probability to represent pre-experimental 
uncertainty in both X and 9, and Bayes' rule to represent uncertainty in 9 after having observed X. 
One appealing aspect of the subjective Bayesian approach is that it is an internally consistent and 
rational way to update information. If V@ = {p(X\9) : 9 6 0} expresses our beliefs about X given 
9 , and n{9) expresses our beliefs about 9, then tt(9\X) oc tt(9)p(X\9) expresses what we should 
believe about 9, having observed X. For n{9\X) to be of most use, both p{X\9) a nd 7r(fl) should 
actually represent our beliefs, at least approximately. Professor Walker's paper (jWalkerl . 120131 ) 
highlights the problem that in practice, a statistical model V& is often used that is known to not 
represent beliefs, in that it is suspected that V@ does not include the distribution that generated 
the data. In such cases, interpretation of tt(9\X) may be problematic: Not only is the validity of 
ir(9\X) as a probabilistic description of information about 9 potentially invalid, it is not even clear 
that 9 represents anything of interest. 

One remedy discussed by Walker is to expand the model so that V® can be assumed to contain 
the correct data generating process, or at least something very close to it. Depending on what the 
data are, this can make the model quite large. Walker focuses on the situation where the data are 
taken to be a sample of observations from a population, i.e. X = {x%, . . . ,x n }. To guarantee that 
the model is not misspecified, "Pe must be quite large, essentially covering (in a topological sense) 
the space of all probability distributions. However, addressing the model misspecification problem 
in this way can complicate the other component of subjective Bayesian inference - specification of 
the prior distribution. The larger the model is, the more difficult it will be to specify a prior that 
represents actual beliefs about the unknown population. If tt(9) does not represent prior beliefs, 
then the use of ir(9\X) as an expression of posterior beliefs is questionable, except possibly when 
the sample size is very large. 



2 Incorrect models with correct pseudo-true parameters 

If we wish to benefit from the internal consistency of subjective Bayesian inference, we need to limit 
our probability statements to those quantities about which we have actual information. As a very 
simple example, suppose we have a sample x±,...,x n of independent measurements for which the 
measurement error a 2 is known. If we have prior information tt(9) about the population mean 9, but 
not any other aspect of the population (other than a 2 ), then we should limit our data X to quantities 
whose sampling distribution depends only on 9 and a 2 . This condition will be approximately met 
by the sample mean x, whose sampling distribution is approximately normal. A limited form of 
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subjective Bayesian inference can proceed via the posterior density ir(9\x) oc tt(9) x p(x\9), where 
the latter density is that of a N(9, a 2 /n) random variable. 

Strictly spe aking, the mod el p(x\9) is misspecified unless sampled from a normal 

population. As Iwalkerl ( 2013h asks, what does 9 represent in the case of model misspecification? 



Letting po(x) be the true sampling distribution of x, the pseudo-true parameter 9* is given by 

a* -ft r-\ a- 

9 = argmin / log — -—. — -po(x ) ax 

e J p(x\9) 

= argrnin / |[log(27Rj 2 /n) + nx 2 /a 2 — 2nx9/a 2 + n9 2 /a 2 ]po(x) dx 
= argmin(6» 2 - 29E[x]) = E[x] = 9, 

6 

and so in this case, the pseudo-true parameter is equal to the parameter of interest, regardless 
of whether or not po is in the model. Furthermore, the posterior distribution given by ir(9\x) oc 
tt(9) xp(x\9) provides (approximate) subjective Bayesian inference for the population mean 9, even 
if the population is not normal, and without having to quantify prior information about anything 
but the first two population moments. 

Now suppose we are interested in estimating a collection of population moments A € K p , where 
Xj = E[gj(x)], j = I,..., p. Is there a parametric model {p(x\9) : 9 € 0} whose pseudo-true 
parameter 9* satisfies E[<7j(x)|#*] = Xj for each j = 1, . . . ,p? Consider the exponential family with 
sufficient statistics {gi(x), . . . , g p (x)} given by p(x\9) = h(x) exp{9igi(x) + • • • 9 p g p (x) — c(9)}. The 
pseudo-true parameter 9* for such a model is given by 



argmin / log ,^ PQ\x) dx 
e J p{x\9) 

argmax J [log p(x\9)]po(x) dx 

argmax J [9ig\{x) H 9 p g p (x)]p (x) dx - c(9), 



where po(x) is the true population density. Taking derivatives with respect to each element of 9 
tells us that 9* is the solution in 9 to 



d[ d 
J [8m (x) + ■■■ 8 p g p (x)}p (x) dx = — 



The left-hand side is J gj(x)po(x) dx = Xj, one of the moments we want to estimate. The right- 
hand side is equal to E[<7j(x)|0], due to the well-known identity for exponential families. Therefore, 
9* is the parameter value such that j gj(x)p(x\9*) dx = J gj{x)po(x) dx for j € {1, . . . ,p}. Thus 
for an exponential family with sufficient statistic {gi(x), . . . ,g p (x)}, the pseudo-true parameter 9* 
satisfies E[^'(x)|0*] = E[<?j(x)], where the latter expectation is with respect to the true population 
distribution. 
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The result above suggests that some models can be used to make inference for certain aspects 
of a population Pq, even if Pq is not a member of the model. Specifically, a possibly incorrect 
model {P$ : 9 G 0} can be used to obtain consistent estimators of those functionals of Pq which 
match those of P$* , where 9* is the pseudo-true parameter. However, this does not ensure that the 
model can correctly represent the sampling variability of such estimators, even asymptotically. As 
a result, confidence intervals based on an incorrect model can be asympt otically i nvalid , even if the 
incorrect model provides a consistent estimator. To address this concern, iHuberl (|1967l ) derived the 
limiting distribution of the MLE 9 of 9 under a possibly incorrect model in terms of the pseudo- 
true parameter. The approximation proceeds roughly as follows: Suppose xi,...,x n are i.i.d. 
observations from population Pq, and let 1(9 : Xj) = logp(xi\9) be the log-likelihood corresponding 
to a single observation xi. A first order Taylor series expansion of J2K@* '■ x i) around the MLE 9 
gives 



n 

£ 

i=l 



1(9* 



1(9 : Xi 




By the central limit theorem, the sum on the left-hand side is approximately N(0,nB), where 
B = V&t[1(9* : x)] and the variance here is under Po- Letting A = Y2i=i '■ x i) De the sum on the 
right-hand side, we have 

(9* - 6) ~ N(0, nA^BA- 1 ), (1) 

where "~" means "approximately distributed as." This result has been used extensively to obtain 
confidence intervals for the pseudo-true parameter 9* , in cases where it corresponds to a population 
quantity of interest. In practice, since 9* is unknown, B is estimated as B = ^ 1(9 : Xi)l(9 : Xi) T /n, 
the sample variance of the likelihood functions at the MLE. The resulting variance estimate C = 
nA~ 1 BA~ 1 is called the sandwich variance estimate for (9* — 9). Confidence intervals for 9* can be 
obtained by approximating the distribution of C~ l l 2 (9* — 9) by a N(0,I) distribution. Sandwich 
confidence intervals avoid the issue of model misspecification by positing the sampling distribution 
of the pivotal quantity C ^-/ 2 (9* — 9), rather than the sampling distribution of xi, . . . ,x n . The 



model used to obtain the likelihoods {1(9 



l,...,n} is simply a tool that provides a 



consistent estimate of the pseudo-true parameter 9* and asymptotically correct confidence intervals 



White 


(1980 


) and 


Rovall 



(|1986l ). Sandwich variance estimation has also been applied to inference bas ed on generalized - 
estimating equations ( GEE) , a popular likelihood-f ree approach to inference ( Liang and Zegerl . 

1983 ). 



1986 



Zeger and Liana . 



1986 



Gourieroux et al. 
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3 A Bayesian sandwich posterior distribution 



While used extensively in practice, sandwich confidence intervals c an behave poorly for low sampl e 



sizes, with coverage often being well below their nominal level ( Kauermann and Carroll . I2OO1I ). 
One reason for this is that the sandwich procedure does not properly account for uncertainty in 
the variance B of 1(9* : x). The replacement of B by B in fact uses two plug-in approximations: 
the MLE 9 for 9*, and the sample covariance B for the population covariance B. Ignoring the 
uncertainty in both of these approximations is likely to provide an underestimate of B, resulting 
in overly-narrow confidence intervals and below-nominal coverage rates. 

One of the attractions of Bayesian inference is that uncertainty in nuisance parameters can 
be accounted for by integrating over their possible values, rather than plugging in point estimates. 
With this in mind, we propose the following version of a "Bayesian sandwich" posterior distribution, 
quantifying the uncertainty in both 9* and B: Given a working model V® = {p(x\9) : 9 € 0} and 
observations x\, . . . , x n ~ i.i.d. Pq, we form a likelihood derived from the approximate joint density 
of the MLE 9 based on V® , and the sum of squares of the derivatives of the log- likelihood functions 
= Ya=i K@ '■ x i)K® '■ x i) T 1 giving us the following approximate likelihood function: 

p(9, S(9*)\9*, B) = p(9\9\ B) x p(S(9*)\9, 9*, B) 

fa dnorm^l^nA- 1 ^- 1 ) x dWishart(S(0*)|n, B), 

where "dnorm" and "dWishart" refer to the normal and Wishart densities respectively. The validity 
of this likelihood is based on three approximations. The first is the normal approximation to the 
distribution of 9 given by (pQ). The second is the conditional independence of S(9*) and 9 and the 
third is the approximation of the distribution of S (9*) with a W ishart distribution. The first of 



these approximations is justified asymptotically by iHuberl (jl967l ). whereas the latter two are, at 
least currently, heuristic. 

Based on this approximate likelihood and a prior distribution for (9*,B), a posterior distri- 
bution can be obtained via MCMC in the usual way. For example, if the priors for 9* and B 
are normal(mo, Vq) and inver se- Wishart (uq, Sq 1 ) respectively, then posterior approximation can 
proceed via the following Gibbs sampler: Given current values of 9* n and B^, 

1. simulate #(* s+1 ) ~ N p (mi,Vi), where 

V-- 1 = V^ 1 + AB^A/n , mi = V 1 [V Q - 1 9 + AB^A9/n], 

2. simulate B7 s +u ~ Wishart(^i, S^ 1 ) , where v\ = vq + n + 1 and 

S, = So + S(9* {s+l) ) + A(9* {s+l) - 9)(9* {s+1) - 9) T A/n. 
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The hyperparameters (mo, Vq) should ideally represent prior information about 6*. Information 
upon which to base {y$, S ^ 1 ) might be harder to come by. One possibility would be to use Jeffreys' 
prior, 7r(.B) oc |P|~(p +1 )/ 2 ( Geisser and Cornfieldl . ll963l ). The posterior distribution under this prior 
can be approximated with the above Gibbs sampler by setting vq = and So equal to the p x p 
matrix of zeros. 



4 Example: Regression with heteroscedastic errors 

Suppose we have a sample (yi, xi), . . . , (y n , x n ) ~ i.i.d. Po, and wish to estimate the linear regression 
of y on x that would be obtained from performing the regression on the entire population. In other 
words, letting x = (l,x), we want to estimate f3 = E[a;a;- r ] _1 E[a;y], where both expectations are 
under Po- Consistent estimation of this quantity can be obtained from the normal regression model 
yi = (3 T Xi + €i, ei,...,e n ~ i.i.d. N(0, 1), even if Po is not in this model, as the pseudo-true 
parameter of this regression model is equal to E[a;a; T ] _1 E[a;y], the population regression parameter 
under p> We also note that the variance of the error terms in the regression model can be taken to 
be any fixed value: Whichever value is specified will end up canceling out in the sandwich variance 
calculation. 

Using the regression model as our working model, we have l(f3 : y,x) = x(y — f3 T x) and 
l{(3 :y,x) = —xx T , giving 

n n 

A = -^XixJ and S((3) = ^Xj(yj - (3 T Xj). 

i=l i=l 

The usual sandwich variance estimate of the MLE (3 under the normal regression model is nA~ 1 BA~ ] 
where B = S((3)/n. In contrast, the proposed Bayesian sandwich posterior distribution infers B 
jointly with j3, based on the Wishart model for S(/3). To compare the performance of the proposed 
Bayesian sandwich posterior to the usual sandwich procedure, we ran a small simulation study in 
order to calculate coverage rates and average interval widths of nominal 95% confidence intervals. 
For both small (n = 10) and large (n = 500) sample sizes, datasets were generated as x\, . . . , x n ~ 
i.i.d. exponential(l), and y%\Xi ~ + fiiXi-, + foxi) 2 )-, where Pi = P2 = 1- Thus the work- 

ing model incorrectly assumes homoscedastic errors, whereas the true population has substantial 
heteroscedasticity. 

For each simulated dataset, we obtained Bayesian sandwich posterior distributions under four 
different priors of the form 7r(/3, B) = tt(P)tt(B), based on two choices for each of 7r(/3) and vr(P). 
The priors for (3 included the (improper) uniform prior on M 2 , and an informative iV((l, \) T , nA~ l ) 
prior distribution. This latter prior, weakly centered around the correct values, represents accurate 
but weak information about (3 that someone may have: The matrix A = ]P xixj is the information 
for (3 from n observations, and so A/n represents the information equivalent of one observation. 
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n = 10 


7r(/3) 

informative uniform 


n = 500 


7r(/3) 

informative uniform 


Jeffreys 
plug-m 


0.95 (2.86) 0.87 (4.80) 
0.69 (2.14) 0.65 (2.63) 


Jeffreys 
piug-m 


0.94 (0.74) 0.94 (0.76) 
0.93 (0.73) 0.93 (0.74) 



Table 1: Coverage rates and average interval widths (in parentheses) of 10,000 nominal 95% con- 
fidence intervals based on the four procedures. The standard (non-Bayesian) sandwich procedure 
corresponds closely to the uniform/plug-in prior combination. 



The priors for B included Jeffreys' prior and a point-mass prior on B, the plug-in estimate of B. 
We note that the uniform/plug- in combination of priors leads to a N p (0, n A~ 1 BA~ 1 ) pos terior 
distribution for f3. This posterior was referred to as the "artificial posterior" by iMiillerl (|201ll ). who 
compared the risk of the resulting estimator to the risk of the Bayes estimator from the working 
model. 

For each sample size we simulated 10,000 datasets from the heteroscedastic regression distribu- 
tion given above, and obtained 95% posterior confidence intervals for the slope fa based on each 
of the four priors. Empirical coverage probabilities and average interval widths are given in Table 
[TJ For each dataset we also obtained a Wald-type interval for fa based on the plug-in sandwich 
variance estimate (the usual sandwich confidence interval), but it performed nearly identically to 
the estimator based on the uniform/plug-in prior, so we do not report these results separately. 

For n = 10, both plug-in procedures perform very poorly in terms of coverage. This seems 
primarily due to underestimation of B, resulting in confidence intervals that are shorter than 
are required to attain 95% coverage. In contrast, the procedures using Jeffreys' prior both take 
uncertainty in B into account, and provide coverage rates closer to the nominal value. However, 
the absence of any prior information about (3 (uniform vr(/3)) leads to interval widths that are 
quite high as compared to those obtained with some prior information (informative tt((3)), as we 
would expect: Accurate prior information about (3 leads to more precise inference. For n = 500, all 
sandwich-based procedures performed similarly, reflecting the asymptotic correctness of sandwich- 
based confidence intervals in general. This is in contrast to the 95% nominal posterior confidence 
intervals based on the (uncorrected) misspecified regression model. For a sample size of n = 500 
and under the informative prior described above, these 95% posterior confidence intervals had a 
coverage rate of only 68% . 
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5 Discussion 



Bayesian inference typically proceeds via the formulation of a sampling model for the data X and a 
prior distribution over the sampling model. To guard against model misspecification, one approach 
is to make the model large enough to ensure that it contains the distribution that generated the 
data. However, such a large model can lead to difficulties in prior specification and posterior calcu- 
lation. Such difficulties can often be avoided when interest is limited to a simple low-dimensional 
parameter 0. In such cases there often exists a statistic t(X) or pivotal quantity s(X,6) whose 
sampling distributions are robust to model misspecification and from which a likelihood can be 
constructed. In this note, we have suggested using the asymptotic "sandwich" distribution of the 
MLE to construct a likelihood, and have illustrated via simulation how Bayesian confidence intervals 
based on this likelihood provide improved performance over the standar d non-Bayesian procedure. 
Other authors have used similar ideas previously: In a testing context, iJohnsonl (|2005l ) shows how 
modeling the distribution of test statistics, rather than the in dividual observat ions, can lead to 



great simplifications in th e calc u 
metric estimation setting, 



at ion of Bayes factors (see also IWakefieldl (|2009i )). In a semipara- 



Hofa (|2007l ) proposes Bayesian inference via a marginal likelihood that 



depends only on the parameter of interest and not an infinite-dimensional nuisance parameter. Ap- 
proaches such as these suggest that simple, robust Bayesian inference can be obtained by restricting 
attention to only those aspects of the data for which confident probability statements can be made. 
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